# rl_queueing

This respository contains single-run evaluations of different algorithms from the Revisiting Familiar Places in an Infinite World: Continuing RL in Unbounded State Spaces

Dependencies:
1. torch 1.13.0
2. stablebaselines3 2.0.0a0
3. gym 0.26.2
4. gymnasium 0.28.1
5. sumo_rl: https://github.com/LucasAlegre/sumo-rl
6. libsumo: https://sumo.dlr.de/docs/Libsumo.html


Files/directories:
- algos: contains average reward PPO, TRPO, and DQN by modifying the replay and rollout buffer files of stablebaselines3
- policies.py contains wrapper classes
- utils.py contains supporting functions
- run_single_continual.py and run_traffic.py are main launch files (see below)
- server_allocation.py is the environment code for single-server queueing and gridworld
- sumo: contains all sumo_rl related code


```
python run_single_continual.py  --outfile <fname> --env_name <env> --mdp_num <mdp_num> --deployed_interaction_steps 5000000  --exp_name stoch  --reward_function <rew_func> --seed 0  --truncated_horizon 512 --algo_name PPO-AR --batch_size 32 --replay_epochs 10 --lr 3e-4 --state_transformation symloge --opt_warmup_time <tau> --opt_beta <beta>
```
where `<fname>` is to save the result; `<env>` is either `gridworld` or `queue`; if `<env>` was `gridworld` then `<mdp_num>` is `1`, else it is `2 or 3`; `<rew_func>`is `avg-q-len` (optimality), `avg-q-len-change`(stability), or `mix-avg-q-len-change` (STOP); and `<tau>` and `<beta>` dictate how optimality is introduced.


```
python run_traffic.py  --outfile <fname> --env_name traffic --mdp_num <mdp_num> --deployed_interaction_steps 500000  --exp_name stoch  --reward_function <rew_func> --seed 0  --truncated_horizon 512 --algo_name PPO-AR --batch_size 32 --replay_epochs 10 --lr 3e-4 --state_transformation symloge --opt_warmup_time <tau> --opt_beta <beta>
```
where `<fname>` is to save the result; `<env>` is `traffic`; `<mdp_num>` is `0, 1, or 2`, for medium, heavy, very heavy traffic respectively; `<rew_func>`is `waiting-time` (optimality), `diff-waiting-time`(stability), or `mix-diff-waiting-time` (STOP); and `<tau>` and `<beta>` dictate how optimality is introduced.